Street Cleaning Function editted by Chang
Editted on 7th Feburary 2020
Following Changes were made:
The combined Street Dictionary had several issues regarding “STR”, “ST” and “STREET”. They have been resolved.
The Street Dictionary (combined and full) for Manhattan and Brooklyn had several abnormalities. From manual checking the issues that were identified included:
The following issues have not been resolved:
From the sample dataset, an output of 557,357 rows by 33 columns was derived.
There were 1,478 Enumeration Districts, 1,276 microfilms and 8,557 unique street names within the dataset.
The results from the Street Matching function were as follow:
The results from the Street Matching function were as follow:
Significant improvements observed in result types 1 and 2 after improving the Street Cleaning Function.
# Distribution of Result Type in Manhattan
plot(table(mn_output$result_type),
type = "h",
col = c("blue", "red", "orange", "purple", "green", "pink"),
lwd = 10,
main = "Result Type for Manhattan",
ylab = "Count",
xlab = "Result Type")
# Sample of Result Type 1 (Perfect Match)
mn_output %>% filter(result_type == 1) %>% select(ED, street_add, best_match, result_type) %>% head(1)
# Sample of Result Type 2 (Identical Match)
mn_output %>% filter(result_type == 2) %>% select(ED, street_add, best_match, result_type) %>% head(1)
# Sample of Result Type 3 (Singular Mode)
mn_output %>% filter(result_type == 3) %>% select(ED, street_add, best_match, result_type) %>% head(1)
# Sample of Result Type 4 (Multiple Modes)
mn_output %>% filter(result_type == 4) %>% select(ED, street_add, best_match, result_type) %>% head(1)
# Sample of Result Type 1 (NA)
mn_output %>% filter(result_type == 5) %>% select(ED, street_add, best_match, result_type) %>% head(3)
# Sample of Result Type 6 (no match)
mn_output %>% filter(result_type == 6) %>% select(ED, street_add, best_match, result_type) %>% head(4)
Out of the problematic Street Matches (result type 5 and type 6), there is a trend of some EDs being more problematic than others, i.e. more entries of 5 or 6 within certain Enumeration Districts.
# Problematic EDs for Result Type 5
# Treshold set to 50 (arbitrarily decided)
mn_output %>% filter(result_type == 5) %>%
select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(19)
## .
## 0815 0836 1272 0451 0818 0049 0142 0009 0363 0766 1000 1202 1643 0316 0557 0110
## 1432 226 153 113 112 100 100 92 88 85 78 74 74 71 64 54
## 0433 0291 0394
## 52 51 50
# Problematic EDs for Result Type 6
# Treshold set to 50 (arbitrarily decided)
mn_output %>% filter(result_type == 6) %>%
select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(8)
## .
## 1481 1913 0081 0026 0764 0782 0242 1941
## 522 248 113 88 84 83 74 67
# Average result_type by ED
average_result_type_mn <- mn_output %>%
select(ED, result_type) %>%
group_by(ED) %>%
summarise(mean_result_type = mean(result_type))
average_result_type_mn$ED <- as.numeric(average_result_type_mn$ED)
mn_result_type_plot <- ggplot(average_result_type_mn,
aes(x = ED, y = mean_result_type)) +
theme_classic() +
geom_point() +
labs(x = "ED", y = "Average Result Type", title = "Mean Result Type by ED (MN)"
) +
geom_vline(xintercept = 1471, color = "red", linetype = "dotted") +
geom_vline(xintercept = 1652, color = "red", linetype = "dotted")
ggplotly(mn_result_type_plot)
# Standard Deviation of result_type by ED
sd_result_type_mn <- mn_output %>%
select(ED, result_type) %>%
group_by(ED) %>%
summarise(sd_result_type = sd(result_type))
sd_result_type_mn$ED <- as.numeric(sd_result_type_mn$ED)
mn_sd_result_type_plot <- ggplot(sd_result_type_mn,
aes(x = ED, y = sd_result_type)) +
theme_classic() +
geom_point() +
labs(x = "ED", y = "Standard Deviation of Result Type", title = "Standard Deviation of Result Type by ED (MN)"
) +
geom_vline(xintercept = 1394, color = "red", linetype = "dotted") +
geom_vline(xintercept = 1652, color = "red", linetype = "dotted")
ggplotly(mn_sd_result_type_plot)
Out of the 28% of Perfect Matches (156k entries) 1,717 entries were matched via the Fill Down function 03_Matched_Street_Fill_Down - 1.1%.
Out of 557,357 entries, 159,669 entries were flagged for house number changes (i.e. 0 or 1) - 28.6%.
Previously, 7,350 entries were flagged with “1” meaning the house number was editted by the function. That has now grown to 7,520 (post street cleaning function edits).
This flag refers to: The initial household number might have been “195-7”, “34 TO 36” or “112 114”. There would be a split between the house numbers and hn_1 will be 195 and hn_2 will be 197 as in the first example.
# Problematic Enumeration Districts for House Number Cleaning
mn_output %>% filter(flag_hn_cleaned == 1) %>%
select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(24)
## .
## 0190 0984 0805 0843 0094 1267 0967 0212 1180 0024 0586 1271 0403 1269 1073 0509
## 634 289 253 244 191 187 177 172 147 144 133 102 101 85 81 79
## 0256 1036 0681 1170 0786 0924 0442 0431
## 74 73 63 63 56 56 52 51
Out of all the house hold number entries, 364,730 entries were filled in via the Fill Down function 05_House_Number_Fill_Down - 65.4%.
That has increased to 368,948 - 66.1% after the street cleaning function was editted.
From the sample dataset, an output of 371,833 rows by 33 columns was derived.
There were 1,106 Enumeration Districts, 1,527 microfilms and 7,076 unique street names within the dataset.
The results from the Street Matching function are as follow:
In contrast to Manhattan, there is a larger proportion of Perfect Matches.
# Distribution of Result type in Brooklyn
plot(table(bk_output$result_type),
type = "h",
col = c("blue", "red", "orange", "purple", "green", "pink"),
lwd = 10,
main = "Result Type for Brooklyn",
ylab = "Count",
xlab = "Result Type")
Out of the problematic Street Matches (result type 5 and type 6), there is a trend of some EDs being more problematic than others, i.e. more entries of 5 or 6 within certain Enumeration Districts.
# Problematic EDs for Result Type 5
# Treshold set to 50 (arbitrarily decided)
bk_output %>% filter(result_type == 5) %>%
select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(18)
## .
## 0915 0519 1007 0556 1038 0570 0477 0491 0562 0905 1028 0580 1026 0807 1408 0935
## 252 203 117 98 92 79 78 75 73 63 62 61 61 59 57 55
## 0154 0365
## 54 52
# Problematic EDs for Result Type 6
# Treshold set to 50 (arbitrarily decided)
bk_output %>% filter(result_type == 6) %>%
select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(10)
## .
## 0608 0161 0058 0253 0572 0909 0488 0733 0158 1073
## 110 99 88 83 78 69 60 56 55 51
# Average result_type by ED
average_result_type_bk <- bk_output %>%
select(ED, result_type) %>%
group_by(ED) %>%
summarise(mean_result_type = mean(result_type))
average_result_type_bk$ED <- as.numeric(average_result_type_bk$ED)
bk_result_type_plot <- ggplot(average_result_type_bk,
aes(x = ED, y = mean_result_type)) +
theme_classic() +
geom_point() +
labs(x = "ED", y = "Average Result Type", title = "Mean Result Type by ED (BK)"
) +
geom_vline(xintercept = 1103, color = "red", linetype = "dotted") +
geom_vline(xintercept = 1406, color = "red", linetype = "dotted")
ggplotly(bk_result_type_plot)
# Standard Deviation of result_type by ED
sd_result_type_bk <- bk_output %>%
select(ED, result_type) %>%
group_by(ED) %>%
summarise(sd_result_type = sd(result_type))
sd_result_type_bk$ED <- as.numeric(sd_result_type_bk$ED)
bk_sd_result_type_plot <- ggplot(sd_result_type_bk,
aes(x = ED, y = sd_result_type)) +
theme_classic() +
geom_point() +
labs(x = "ED", y = "Standard Deviation of Result Type", title = "Standard Deviation of Result Type by ED (BK)"
) +
geom_vline(xintercept = 1103, color = "red", linetype = "dotted") +
geom_vline(xintercept = 1406, color = "red", linetype = "dotted")
ggplotly(bk_sd_result_type_plot)
Out of the 76.5% of Perfect Matches (284k entries) 1,005 entries were matched via the Fill Down function 03_Matched_Street_Fill_Down - A much smaller percentage than Manhattan. This suggests that the recording of entries might be more accurate in Brooklynn or that the street directory is more well-developed. Nevertheless, errors are still present.
Out of 371,833 entries, 179,303 entries were flagged for house number changes (i.e. 0 or 1) - 48.2%.
2,563 entries were flagged with “1” meaning the house number was editted by the function. Brooklyn also has several abnormal initial house numbers. E.g. “222 1/2”, “192 3TH”. Otherwise, the issues are similar to Manhattan.
# Problematic Enumeration Districts for House Number Cleaning
bk_output %>% filter(flag_hn_cleaned == 1) %>%
select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(11)
## .
## 0614 0016 0013 0456 0321 0518 0029 0358 0003 0239 0419
## 322 156 145 109 103 102 80 80 77 77 68
Out of all the house hold number entries, 182,950 entries were filled in via the Fill Down function 05_House_Number_Fill_Down - 49.1%.
# Dan's function - Change ED's factor format to numeric
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
# Aggregate the mean of result type, i.e. street matching success rate
# The lower the better
mn_output_result_type <- aggregate(result_type ~ ED, FUN = mean, data = mn_output)
mn_output_result_type$ED <- as.numeric(mn_output_result_type$ED)
# Change Manhattan Shapefile ED variable to numeric format (Dan's function)
mn_map@data$ED <- as.numeric.factor(mn_map@data$ED)
# Left join the 2 datasets
mn_map@data <- left_join(mn_map@data, mn_output_result_type, by = c('ED' = 'ED'))
# Map for Result Type in Manhattan
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(mn_map) + tm_polygons("result_type")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
# Aggregate the mean of result type, i.e. street matching success rate
# The lower the better
bk_output_result_type <- aggregate(result_type ~ ED, FUN = mean, data = bk_output)
bk_output_result_type$ED <- as.numeric(bk_output_result_type$ED)
# Change Brooklyn Shapefile ED variable to numeric format
bk_map@data$ED <- as.numeric.factor(bk_map@data$ED)
# Left join the 2 datasets
bk_map@data <- left_join(bk_map@data, bk_output_result_type, by = c('ED' = 'ED'))
# Map for Result Type in Brooklyn
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(bk_map) + tm_polygons("result_type")
# Highlight 2 examples in Brooklyn with poor street name matches
bk_output %>% filter(ED == 1034 | ED == 1039)
# Missing EDs in Manhattan
# Manhattan ED list
mn_ed_list <- mn_map@data$ED %>% sort()
mn_output_ed <- mn_output$ED %>% unique() %>% as.numeric() %>% sort()
# Which EDs are in the Shapefile but missing from the Manhattan dataset
mn_ed_list[!(mn_output_ed %in% mn_ed_list)]
## [1] 43 47 1431 1519 1742 1743 1745
length(mn_ed_list[!(mn_output_ed %in% mn_ed_list)])
## [1] 7
# Which EDs are in the dataset but missing from the Manhattan shapefile
mn_output_ed[!(mn_ed_list %in% mn_output_ed)]
## [1] 699 1070 1175 1712 1727 1737 1745 1747 1748
length(mn_output_ed[!(mn_ed_list %in% mn_output_ed)])
## [1] 9
# Missing EDs in Brooklyn
# Brooklyn ED list
bk_ed_list <- bk_map@data$ED %>% sort()
# Brooklyn Output EDs
bk_output_ed <- bk_output$ED %>% unique() %>% as.numeric() %>% sort()
# Which EDs are in the Shapefile but missing from the Brooklyn dataset
bk_ed_list[!(bk_output_ed %in% bk_ed_list)]
## [1] 512
length(bk_ed_list[!(bk_output_ed %in% bk_ed_list)])
## [1] 1
# Which EDs are in the dataset but missing from the Brooklyn shapefile
bk_output_ed[!(bk_ed_list %in% bk_output_ed)]
## [1] 761 950 958 963 964 1040 1086
length(bk_output_ed[!(bk_ed_list %in% bk_output_ed)])
## [1] 7
There are 38,174 entries with NA best_match. Their street_add entries are also NA. There are 792 EDs with these problems.
na_mn_output <- mn_output %>% filter(is.na(best_match)) %>% select(ED) %>% unique()
na_mn_output <- na_mn_output %>% mutate('NA best match' = 1)
na_mn_map <- merge(mn_map, na_mn_output, 'ED', 'ED')
# Map for NA Street Address / Best Match in Manhattan
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(na_mn_map) + tm_polygons("NA best match")
# Another problematic best match is "0"
mn_output %>% filter(best_match == 0) %>% select(street_add) %>% unique()
There are 9,264 entries with NA best_match. Their street_add entries are also NA. There are 427 EDs with these problems.
na_bk_output <- bk_output %>% filter(is.na(best_match)) %>% select(ED) %>% unique()
na_bk_output <- na_bk_output %>% mutate('NA best match' = 1)
na_bk_map <- merge(bk_map, na_bk_output, 'ED', 'ED')
# Map for NA Street Address / Best Match in Brooklyn
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(na_bk_map) + tm_polygons("NA best match")
# Another problematic best match is "0"
bk_output %>% filter(best_match == 0) %>% select(street_add) %>% unique()